Maps of Uncertinaty
The techniques in this section allow us to create maps of where our uncertinaty may be higher or lower. The most straight forward is a map that shows the standard deviation of our output results after a series of MC runs. The approach is to:
- Create a loop that creates a model and a model prediction with a random component. The random component could be cross-validation, noise injection, sensitivity testing, or some combination of all of these.
- After each model is created, do a prediction and compute the residuals.
- Save the residuals for further analysis.
- After the runs are completed, compute the standard deviation of the residuals.
- Use the standard deviation of the residuals to create an uncertinaty map.
This process can be used to create a wide variety of uncertianty maps.
Basic Method
Note that this section describes portions of the code. The full code is in the next section
The solution below will work for creating uncertianty maps in most situations until we use cross-validation and are using the original data for the prediction. The next section shows an approach for cross validation.
This approach uses a matrix to capture all the predicted values. Each time a model run is completed, the predicted values are added as a column to the matrix. After all runs are finished, the standard deviation of each row can be computed to create a vector that can be added back into a dataframe and then used to create an uncertianty map. This provides a map of the standard deviation of the predictors.
The first step is to include the 'matrixStats' library which includes a number of functions for matrixes including one to compute the standard deviation of the rows in a matrix.
# matrixStats includes a function to compute the std dev of a row require(matrixStats)
Next, create an empty matrix that has the same number of rows as the dataframe and the number of columns the same as the number of times you will repeat the process (e.g. Number of folks for k-fold validation).
# create an empty matrix PredictionMatrix=matrix(, nrow = NumRowsInDataFrame, ncol = 1)
Then, inside your loop, compute the residuals vector for all rows in your dataframe and add it to your residuals matrix.
# go through the residuals adding them into the residuals matrix i=1 NumRows=length(Residuals) while (i<=NumRows) { OriginalIndex=OriginalIndexes[i] # get the original index for this data point Value=unname(Residuals[i]) # get the Residual value without the name ResidualsMatrix[OriginalIndex,1]=Value # set the value into the residuals matrix i=i+1 }
Finally, after the loop, compute the standard deviations for the rows in the matrix. This will result in a vector that can be added to the dataframe and then used to create a map. Note that matrixStats also includes functions for computing means and other matrix values that can be used to evalute models.
# get the standard deviation for each row in the predicted values matrix PredictionStdDevs=rowSds(PredictionMatrix) # get the mean for each row in the predicted values matrix PredictionMeans=rowMeans(PredictionMatrix)
Creating Uncertianty Maps When Performing Cross-Validation and Predicting with the Original Data Set
There are times when we will have a continous response varaible and we are preidcting values on the same data set that we used to create the mdoel. These are rare but would include when we have a biomass or land cover data set, created and model of biomass, and then ran it in future senarios.
The approach below uses k-fold cross-validation to find the residuals and then adds the residuals back into our original data so we can use them to create an uncertinaty map. Using k-fold cross-validation gives us one value for each point in our sample. This function could be changed to use a 70/30 split or to do mulitple k-fold cross-validations or other approachs. The one problem I ran into with 70/30 split is that you can end up with points that do not have a residual or just have one making caclulation of a std dev of the residuals impossible.
# matrixStats includes a function to compute the std dev of a row require(matrixStats) # load the file with the field measurements for DougFir height TheData = read.csv("C:\\Users\\jim\\Desktop\\GSP 570\\Lab 07\\DougFir_All_CA_8km.csv") # Remove any NA (null) values, the data should already be loaded TheData=na.omit(TheData) # Setup the number of loops and number of folds for each loop NumFolds=10 # number of folds (loops) NumLoops=2 # number of loops (repeat k-fold validation) # create an empty matrix for the residuals # Each row is for one row in our data, each column is for the residuals # from one run through the loop below. OutputMatrix=matrix(, nrow = nrow(TheData), ncol = NumLoops) # add a column with ids for the original rows TheData$ID=seq(1:nrow(TheData)) # Repeat entire k-fold validcation LoopCount=1 while (LoopCount<=NumLoops) { # create the folds (buckets) for the data TheData2 <- TheData[sample(nrow(TheData)),] # shuffle the data TheFolds <- cut(seq(1,nrow(TheData2)),breaks=NumFolds,labels=FALSE) # divide the data into buckets FoldCount=1 while (FoldCount<=NumFolds) { # split the data into test and training testIndexes <- which(TheFolds==FoldCount) # get the indexes to the next fold in the data testData <- TheData2[testIndexes, ] # used the testindexes to sub-sample the data into a test dataset trainData <- TheData2[-testIndexes, ] # use the training indexes to sub-sample the data into a training dataset # Create a GAM based on the training data attach(trainData) TheModel=gam(Height~s(AnnualPrecip),data=trainData,gamma=1.4) detach("trainData") # create the preidction using the test data attach(testData) TestPrediction=predict.gam(TheModel,testData,type='response') detach("testData") # compute the output values (default to the prediction) OutputValues=TestPrediction # get the original indexes for the test data to update the residuals OriginalIndexes=TheData2$ID[testIndexes] # go through the residuals adding them into the residuals matrix i=1 NumRows=length(OutputValues) while (i<=NumRows) { OriginalIndex=OriginalIndexes[i] # get the original index for this data point Value=unname(OutputValues[i]) # get the Residual value OutputMatrix[OriginalIndex,LoopCount]=Value # set the value into the residuals matrix i=i+1 } FoldCount=FoldCount+1 } LoopCount=LoopCount+1 } OutputStdDev=rowSds(OutputMatrix) TheData$SD=OutputStdDev